A scaling law beyond Zipf 's law and its relation with Heaps' law 



Francesc Font-Clos 
Centre de Recerca Matematica, Edifici C, Campus Bellaterra, 
E-08193 Bellaterra (Barcelona), Spain and 
Departament de Matemdtiques, Universitat Autdnoma de Barcelona, 
Edifici C, E-08193 Bellaterra (Barcelona), Spain 

Gemma Boleda 

Department of Linguistics, The University of Texas at Austin, 
1 University Station B5100, Austin, TX, USA 

Alvaro Corral 

Centre de Recerca Matematica, Edifici C, 
Campus Bellaterra, E-08193 Bellaterra (Barcelona), Spain 

The dependence with text length of the statistical properties of word occurrences has long 
been considered a severe limitation for the usefulness of quantitative linguistics. We propose 
a simple scaling form for the distribution of absolute word frequencies which uncovers the 
robustness of this distribution as text grows. In this way, the shape of the distribution is 
always the same and it is only a scale parameter which increases linearly with text length. 
By analyzing very long novels we show that this behavior holds both for raw, unlemmatized 
texts and for lemmatized texts. For the latter case, the word-frequency distribution is well fit 
by a double power law, maintaining the Zipf 's exponent value 7 ~ 2 for large frequencies but 
yielding a smaller exponent in the low frequency regime. The growth of the distribution with 
text length allows us to estimate the size of the vocabulary at each step and to propose an 
alternative to Heaps' law, which turns out to be intimately connected to Zipf's law, thanks 
to the scaling behavior. 

I. INTRODUCTION 



Zipf's law is perhaps one of the best evidences of the existence of universal physical-like laws in 
cognitive science and in the social sciences. Classic examples of it include the population of cities, 
the assets of companies, and the frequency of words in texts or speech Taking the latter case, 
the law is obtained directly by counting the number of repetitions, i.e., the absolute frequency 
n, of all words in a long enough text, and assigning increasing ranks, r = 1,2,..., to decreasing 



2 



frequencies. If a power-law relation 



n oc 



holds for a large enough range, with exponent (3 more or less close to 1, then it is considered that 
Zipf's law is fulfilled (with oc denoting proportionality). An equivalent formulation of the law is 
obtained in terms of the probability distribution of the frequency n, making it to play the role of 
a random variable, for which a power-law distribution 



D(n) oc — , 

should hold, with 7 = 1 + 1/(3 (taking values close to 2) and D(n) the probability mass function 
of n (or the probability density of n, in a continuous approximation) 0~5]. Note that this picture 
implies to perform a double statistics, first counting words to get frequencies and then counting 
repetition of frequencies to get the distribution of frequencies. 

We can recognize that the criteria for the validity of Zipf's law are rather vague (long enough 
text, large enough range, exponent /3 more or less close to 1). Generally, a long enough text means 
a book, a large range can be a bit more than an order of magnitude, and the proximity of the 
exponent /3 to 1 translates into an interval (0.7,1.2), or even beyond that [3, 6, 7|. Moreover, no 
rigorous methods have been usually required for the fitting of the power-law distribution, with 
linear regression in double-logarithmic scale being the most common one, either for n(r) or for 
Din). But it is well known that this procedure has severe drawbacks and can lead to flawed 

n 

results [81. Nevertheless, once these limitations have been assumed, the fulfillment of Zipf's law in 
linguistics is astonishing, being valid no matter the author, style, or language the law 

is universal, at least in a qualitative sense. 

At a theoretical level, many different competing explanations of Zipf's law have arisen [jj, 



as random (monkey) typhi 



least effort principle 



g 9|, [l0(, preferential repetitions or proportional growth 



19 ]. and, beyond linguistics, Boltzmann-type approaches [17J, or even 



3, 



the 



avalanche dynamics in a critical system 



18| ; remarkably, most of these options have generated 



considerable controversy In any case, the power-law behavior is the hallmark of scale 

invariance, i.e., the impossibility to define a characteristic scale, either for frequencies or for ranks. 
Although power laws are sometimes also referred to as scaling laws, we will make a precise distinc- 
tion here. In short, a scaling law is any function invariant under a scale transformation (which is a 
linear dilation or contraction of the axes). In one dimension the only scaling law is the power law, 



but this is not true with more than one variable 



221 ] . Note that in text statistics, other variables 



3 



to consider in addition to frequency are the text length L (the total number of words, or tokens) 
and the size of the vocabulary Vl (i-e., the number of different words, or ty pes ). 

Somehow related to Zipf's law is Heaps' law (also called Herdan's law {23! - Q]), which states 
that the vocabulary Vl grows as a function of the text length L as a power law, 

V L oc L a , 

with exponent a smaller than one. However, even simple log-log plots of Vl versus L do not show a 
convincing linear behavior and therefore, the evidence for this law is somewhat weak. Nevertheless, 
a number of works have derived the relationship /3 = 1/a between Zipf's and Heaps' exponents 
25 1, at least in the infinite-system limit Q, H]' using different assumptions. 

Despite the relevance of Zipf's law, and its possible relations with criticality, few systematic 
studies of the dependence of the law on system size (i.e., text length) have been performed. But 
it was Zipf himself [l, pp. 144] who first observed a variation in the exponent f3 when the system 
size was varied. In particular, "small" samples would give /3 < 1, while "big" ones yielded /3 > 1. 
However, that was attributed to "undersampling" and "oversampling" , as Zipf believed that there 
was an optimum system size under which all words occurred in proportion to their theoretical 
frequencies, i.e., tho se g iven by the exponent (3 = 1. This increase of f3 with L has been confirmed 



later, see Refs. 
rather limited 



23 



281 ]. leading to the conclusion that the practical usefulness of Zipf's law is 



23J. More recently, using rather large collections of books from single authors, 



Bernhardsson et al. 



1. Q find a decrease of the exponents 7 and a with text length, in correspondence 
with the increase in (3 found by Zipf and others. They propose a size-dependent word-frequency 
distribution based on three main assumptions: 

(i) The vocabulary scales with text length as Vl oc L a ( L \ where the exponent a(L) itself depends 
on the text length. Note that this is not an assumption in itself, just notation, and is also 
equivalent to writing the average frequency (n) = L/Vl as (n(L)) oc L 1 ^"^. 

(ii) The maximum frequency is proportional to the text length, i.e. n max = n(r = 1) oc L. 

(iii) The functional form of the word frequency distribution DL{n) is that of a power law with 
an exponential tail, with both the scale parameter c{L) and the power-law exponent j(L) 
depending on the text length L. That is, 

-n/c(L) 

D L {n) = A- 



n -y( L ) 

with 1 < 7(L) < 2. 
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Taking c{L) = cqL guarantees that n max oc L; moreover, the form of Di{n) implies that, 
asymptotically, (n(L)) oc L 2 " 7 ^ 22], which, comparing with condition (i) leads to 



a(L) = 7 (L) - 1, 

so, < a(L) < 1. This relation between a and 7 is in agreement with previous results if L is fixed 



, 2 



n 



2, l26|,l27[. ]t was claimed in Ref. 



29l | that a(L) seems to decrease from 1 to for increasing L 



and therefore j(L) decreases from 2 to 1. The resulting functional form, 

e -n/(c L) 



D L {n) = A- 



n l+a(L) ' 

is in fact the same appearing in many critical phenomena, where the power-law term is limited 
by a characteristic value of the variable, CqL, arising from a deviation from criticality or from 
finite-size effects 



22 



3CH32I]. Note that this implies that the tail of the frequency distribution is 
not a power law but an exponential, and therefore the frequency of most common words is not 
power-law distributed. This is in contrast with recent studies that have clearly established that 



the tail of Dl(u) is well modelled by a power law [8|, |33J. But what is most uncommon about this 



functional form is to have a "critical" exponent that depends on the system size, rather, the values 
of exponents should not be influenced by external scales. So, we seek for an alternative picture 
that is in more agreement with typical scaling phenomena. 

Our proposal is that, although the word- frequency distribution -Dl(^) changes with system 
size L, the shape of the distribution is independent on L and Vl, and it is only the scale of 
Di(n) which changes with these variables. This implies that the shape parameters of -D^(n) (in 
particular, any exponent) do not change with L and only one scale parameter changes with L, 
increasing linearly. This is explained in the next section, while the third one is devoted to the 
validation of our scaling form in real texts, using both plain words or their corresponding lemma 
forms; in the latter case an alternative to Zipf 's law can be proposed, consisting in a double power- 
law distribution. Our findings for words and lemmas s ugg est that the previous observation that 



the Zipf 's exponent depends on the text length 



23. 



28, 



29j ]. might be an artifact of the increasing 



weight of a second regime in the distribution of frequencies beyond a certain text length. The fourth 
section investigates the implications of our scaling approach for Heaps' law. Although the scaling 
ansatz we propose has a counterpart in the rank-frequency representation, we prefer to illustrate 
it in terms of the distribution of frequencies, as this approach has been found more appropriate 
from a statistical point of view i&J. 
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II. THE SCALING FORM OF THE WORD-FREQUENCY DISTRIBUTION 

Let us come back to the rank-frequency relation, in which the absolute frequency n of each 
word is a function of its rank r. Defining the relative frequency as x = n/L and inverting the 
relationship, we can write 

r = G L {x). 

Note that here we are not assuming a power-law relationship between r and x, just a generic 
function Gl, which may depend on the text length L. Instead of the three hypothesis introduced 
by Bernhardsson et al. we just need one hypothesis, which is the independence of the function Gl 
on L; so 

r = G(n/L), 

This turns out to be a scaling law, with G{x) a scaling function. It means that if in the first 10,000 
words of a book there are 5 words with relative frequency larger than or equal to 2%, that is, 
G(0.02) = 5, then this will still be true for the first 20,000 words, and for the first 100,000 and for 

)e the same ones, although in some cases they 



291 ] that the frequency of the most used word 



the whole book. These words need not necessarily 
might be. In fact, instead of assuming as in Ref. 
scales linearly with L, what we assume is just that this is true for all words, at least on average. 

Now let us introduce the survivor function or complementary cumulative distribution function 
Sl(ji) of the absolute frequency, defined in a text of length L as Sl(ti) = Prob [frequency > n]. 
Note that, estimating from empirical data, Si(n) turns out to be essentially the rank, but divided 
by the total number of ranks, Vl, i.e., S'l(^) = t/Vl- Therefore, using our ansatz for r we get 

VL 

Within a continuous approximation the probability mass function of n, Di(n) = Prob [frequency = 
n], can be obtained from the derivative of 5l(ti), 

where g is minus the derivative of G, i.e., g(x) = —G'{x). If one does not trust the continuous 
approximation, one can write Dl{tl) = S 'x(w) — S i(n+l) and perform a Taylor expansion, for which 
the result is the same, but with g(x) ~ —G'(x). In this way, we obtain simple forms for S"i,(n) 
and L)^(n), which are analogous to standard scaling laws, except for the fact that we have not 



specified how Vl changes with L. If Heaps' law holds, Vl oc L a , we recover a standard scaling law, 
Di{n) = g(x I L) / L l+a , which fulfills invariance under a scalin g tr ansformation, or, equivalently, 



fulfills the definition of a generalized homogeneous function 

D\ LL {\ n n) = X D D L (n), 
where A^, A n , and \d are the scale factors, related in this case through 

A n = Al = A 

and 

However, in general (if Heaps' law does not hold), the distribution Dl(ji) still is invariant under a 
scale transformation but with a different relation for A^, which is 

So, Dl(ji) is a not a generalized homogeneous function, but presents an even more general form. 
In any case, the validity of the proposed scaling law, Eq. (1), can be checked by performing a very 
simple rescaled plot, displaying LVlDl(ji) versus n/L. A resulting data collapse would be in favor 
of the independence of the scaling function on L. This is undertaken in the next section. 



III. DATA ANALYSIS RESULTS 



To test the validity of our predictions, summarized in Eq. ([I]), we analyze a corpus of literary 
texts, comprised by seven large books in English, Spanish, and French (among them, some of the 
longest novels ever written, in order to have as much as possible statistics of homogeneous texts). In 
addition to performing the statistics of the words in the texts, we consider the lemmatized version 
of each text, where each word is substituted by its corresponding lemma (roughly speaking, the 
stem form of the word), and we consider the statistics of lemmas as well. Appendix IA1 provides 
detailed information on the lemmatization procedure, and Table U summarizes the most relevant 
characteristics of each of the books. 

First, we plot the distributions of word frequencies, Dl{u) versus re, for each book, considering 
either the whole book or the first L/L to t fraction, where L to t is the real, complete text length (i.e., 
if L = L to t/2 we consider just the first half of the book, no average is performed over parts of 
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Title 


author 


language 


year 


Ltot 


v tot 


r(0 
^tot 


y(l) 
v tot 


Artamene 


Scudery siblings 


French 


1649 


2,078,437 


25,161 


1,737,556 


5,008 


Clarissa 


Samuel Richardson 


English 


1748 


971,294 


20,490 


940,967 


9,041 


Don Quijote 


Miguel de Cervantes 


Spanish 


1605-1615 


390,436 


21,180 


378,664 


7,432 


La Regenta 


L. Alas "Clarin" 


Spanish 


1884 


316,358 


21,870 


309,861 


9,900 


Le Vicomte de Bragelonne 


A. Dumas (father) 


French 


1847 


693,947 


25,775 


676,252 


10,744 


Moby-Dick 


Herman Melville 


English 


1851 


215,522 


18,516 


204,094 


9,141 


Ulysses 


James Joyce 


English 


1918 


268,144 


29,448 


242,367 


12,469 



Table I: Total text length and vocabulary before (Ltot, Hot) and after (L^ t , V^l) the lemmatization process, 
for all the books considered (including also their author, language, and publication year). The text length 
for lemmas is shorter than for words because for a number of word tokens their corresponding lemma type 
could not be determined, and they were ignored. 



size L). For a fixed book, we observe that different L leads to distributions with small but clear 
differences, see Fig. [TJ The pattern described by Bernhardsson et al. (equivalent to Zipf 's findings 
for the change of the exponent f3) seems to hold, as the absolute value of the slope in log-log scale 
(i.e., the apparent power-law exponent 7) decreases with increasing text length. 

However, a scaling analysis reveals an alternative picture. As suggested by Eq. ([1]), plotting 
LVlDl^h) against n/L for different values of L yields a collapse of all the curves onto a unique 
L— independent function, which represents the scaling function g{x). Figure [2] shows this for the 
same books and parts as in Fig. [TJ The data collapse can be considered excellent, except for the 
smallest frequencies. For the largest L the collapse is valid up to n ~ 3 if we exclude La Regenta, 
which only collapses for about n > 6. So, our scaling hypothesis is validated, independently of 
the particular shape that g(x) takes. Note that g(x) is independent on L but not on the book, 
i.e., each book has its own g(x), different from the rest. In any case, we observe a slightly convex 
shape in log-log space, which leads to the rejection of the power-law hypothesis for the whole range 
of frequencies. Nevertheless, the data does not show any clear parametric functional form. Any 
of a double power law, a stretched exponential, a Weibull, or a lognormal tail could be fit to the 



distributions. This is not incompatible with the fact that the 



arge n tail can be well fit by a power 



law (the Zipf's law), for more than 2 orders of magnitude 33]. 

Things turn out to be somewhat different after the lemmatization process. The scaling ansatz 
is still clearly valid for the frequency distributions, see Fig. [3l but with a different kind of scaling 
function g(x), with a more defined characteristic shape, due to a more pronounced log-log curvature 
or convexity. In fact, close comparison of the data leads to conclude that the lemmatization process 
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10 



10" 



10 J 



10 4 



10 



10 z 



10 J 



10* 



-I I 1 1""!-! 

L=971294, V=20490 
L=612846, V=16123 
L=386680, V=12497 
L=243979, V= 9711 
L=153940,V= 7853 
L= 97130, V= 6331 




Clarissa (English) 



-1 | 1 i-r-i 

L=268144, V=29448 
L=169187, V=21369 
L=106750, V=14934 
L= 67355, V= 10881 
L= 42498, V= 7796 
L= 26814, V= 




Ulysses (English) 



L=693947, V=25775 
L=437851, V=20760 
L=276266, V=16652 
L=174312, V=13389 
L=109984, V=10465 
L= 69395, V= 8235 




L=316358, V=21870 
L=199609, V=18014 
L=125945, V=14484 
L= 79466, V= 11 261 
L= 50139, V= 8650 
L= 31636, V= 6656 



Bragelonne (French) 




Regenta (Spanish) 



4- 



4- 



L=2078437, V=25161 — 
L=1311407, V=20926 

L= 827444, V=17029 — B- 
L= 522082, V=14237 
L= 329412, V=l 1606 

207845, V= 9328 — h- 




L=390436, V=21180 
L=246349, V= 16390 
L=155436, V=12385 
L= 98073, V= 9845 
L= 61880, V= 7722 
L= 39044, V= 5982 



Artamene (French) 




10 



10 z 



10 J 



10 H 



Quijote (Spanish) 



10 \0 Z 



10 3 10 4 



Figure 1: Density of word frequencies Dl(ji) (y-axis) against absolute frequency n (a;- axis), for six different 
books, taking text length L = L tot /10, i to t/10 4 ^ 5 , Ltot/10 3 / 5 , . . . , L tot . The slope seems to decrease with 
text length. 
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10" 



10" 



10" 



10" 



10" 



10" 



10" 



10" 



10" 



10" 



I 1 — 

L=971294, V=20490 
L=612846, V=16123 
L=386680, V=12497 
L=243979, V= 9711 
L=153940,V= 7853 
L= 97130, V= 6331 




Clarissa (English) 



i — ■ — 

L=268144, V=29448 
L=169187, V=21369 
L=106750, V=14934 
L= 67355, V= 10881 
L= 42498, V= 7796 
L= 26814, V= 5889 




Ulysses (English) 



L=693947, V=25775 
L=437851, V=20760 
L=276266, V=16652 
L=174312, V=13389 
L=109984, V=10465 
L= 69395, V= 8235 




L=316358, V=21870 
L=199609, V=18014 
L=125945, V=14484 
L= 79466, V= 11 261 
L= 50139, V= 8650 
L= 31636, V= 6656 



Bragelonne (French) 




Regenta (Spanish) 




L=2078437, V=25161 — 

L=1311407, V=20926 

L= 827444, V=17029 — B- 

L= 522082, V=14237 

L= 329412, V=l 1606 

L= 207845, V= 9328 — h- 



L=390436, V=21180 
L=246349, V= 16390 
L=155436, V=12385 
L= 98073, V= 9845 
L= 61880, V= 7722 
L= 39044, V= 5982 



Artamene (French) 




Quijote (Spanish) 



10" 6 10" 5 10" 4 10" 3 10" 2 10" 6 10" 5 10" 4 10" 3 10" 2 



Figure 2: Rescaled densities LVlDl(ti) (y-axis) against relative frequency n/L (x-axis), for the same books 
and fractions of text as in Fig. Q] As it is apparent, the rescaled densities collapse onto a single function, 
independently of the value of L, validating our proposed scaling form for Dl(ti) [Eq. ([!])] and making clear 
that the decrease of the log- log slope with L is not a consequence of a genuine change in the scaling properties 
of the distribution. 
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enhances the goodness of the scaling approximation, specially in the low frequency zone. It could 
be reasoned that, as lemmatized texts have a significantly reduced vocabulary compared to the 
original ones, but the total length remains essentially the same, they are somehow equivalent to 
much longer texts, if one considers the length-to-vocabulary ratio. Although this matter needs to 
be further investigated, it supports the idea that our main hypothesis, the scale- invariance of the 
distribution of frequencies, holds more strongly for longer texts. 

Due to the clear curvature of g{x) in the lemmatized case, we go one step further and propose 
a concrete function to fit these data, namely, 

k 

g( X ) = — t— . (2) 

UK ' x{a + x^~ 1 ) K ' 

This function has two free parameters, a and 7 (with 7 > 1 and a > 0), and behaves as a 
double power law, that is, for large x, g(x) ~ x~^ (we still have Zipf's law), while for small x, 
g(x) ~ x . The transition point between both power-law tails is determined by a, and k is fixed 
by normalization. But an important issue is that it is not g{x) which is normalized to one but 
Z?i(n). We select a power-law with exponent one for small x for three reasons: first, in order to 



explore an alternative to the power law in the Vl versus L relation (which is not clear 



by data, see next section); second, to compare better our results with those of Ref. 29fl ; and third 



supported 



to keep the number of parameters minimum. Thus, we do not look for the most accurate fit but 

for the simplest description of the data. Althou gh d ouble power laws have been previously fit to 

rank-frequency plots for unlemmatized corpora the resulting exponents for large ranks 

(low frequencies) are different than for our lemmatized texts. 

1 

Then, defining n a = at- 1 L, the corresponding word-frequency density turns out to be 

Dl(ti) oc — — ; — - — r-, (3) 

1 

with n a the scale parameter (the scale parameter of g(x) was o^ 1 ). The data collapse in Fig. [3] 
and the good fit imply that the Zipf-like exponent 7 does not depend on L, but the transition 
point between both power laws, n a , obviously does. Hence, as L grows the transition to the ~ ra -7 
regime occurs at higher absolute frequencies, given by n a , but fixed relative frequencies, given 
by at- 1 . In Table ITU we report the fitted parameters for all seven books, obtained by maximum 
likelihood estimation for the frequencies of the whole books, as well as Monte Carlo estimates of 
their uncertainties. We have confirmed the stability of 7 fitting only a power-law tail from a fixed 
common relative frequency, for different values of L 



33] 



Regarding the low- frequency exponent, one could find a better fit if the exponent were not 
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10" 



10" 



10" 



10" 



10" 



10 6 10 5 



10" 



10" 



10" 



10 10 k 




I 1 — I 

L=940967, V=9041 
D=593711,V=7785 
L=374607, V=6550 
L=236361, V=5467 
L=149134, V=4665 
L= 94097, V=3936 
g(x) 



Clarissa (English) 




L= 


=676252, V= 


10744 


L= 


=426687, V= 


9351 


L= 


=269221, V= 


8042 


L= 


=169867, V= 


6843 


L= 


= 107179, V= 


5695 


L= 


= 67625, V= 


4756 






g(x) 



Bragelonne (French) 



4- 



4- 




L= 1737556, V=5008 
L= 1096326, V=4505 
L= 691736, V=3973 
L= 436457, V=3523 
L= 275386, V=3080 
L= 173757, V=2664 
g(x) 



Artamene (French) 





242367, V= 


12469 


L= 


152923, V= 


9539 


L= 


96488, V= 


7160 


L= 


60880, V= 


5750 


L= 


38412, V= 


4381 


L= 


24237, V= 


3467 






gW 




i 10 1 
10 8 



Ulysses (English) 
" I I I 




L=309861, V=9900 
L=195510, V=8667 
L=123358, V=7383 
L= 77834, V=6129 
L= 49110, V=5006 
L= 30986, V=4 126 
g(x) 



Regenta (Spanish) 



L=378664, V=7432 
L=238921, V=6118 
L=150749, V=4848 
L= 95116, V=4166 
L= 60014, V=3533 
L= 37866, V=2883 




Quijote (Spanish) 



10 6 10 5 10 4 10 3 10 2 



10 6 10 5 10 4 10 3 10 2 



Figure 3: Same rescaled distributions as in previous figure (LVlDl(ji) versus n/L), but for the frequencies 
of lemmas. The data collapse guarantees the fulfillment of the scaling law also in this case. The fit resulting 
from the double power- law distribution, Eq. ([2]), is also included. 

fixed to be one; however, our data does not allow to constrain well this value. A more important 
point is the influence of lemmatization errors in the characteristics of the low- frequency regime. 
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title 


n a ± cr„ a 


7±cr 7 


a±a a 


Artamene^ 


129.7 ± 12.6 


1.807 ±0.026 


(4.65 ±0.91) • 10~ 4 


Clarissa^ 


32.70 ±2.17 


1.864 ±0.021 


(1.40 ±0.24) • 10~ 4 


Don Quijote^ 


7.91 ±0.75 


1.827 ±0.020 


(1.35 ±0.22) • 10~ 4 


La Regenta^) 


9.45 ±0.66 


1.983 ±0.021 


(3.68 ±0.62) • 10" 5 


Bragclonnc^ 


14.56 ± 1.23 


1.866 ±0.018 


(9.10 ± 1.37) • 10~ 5 


Moby-Dick^ 


8.21 ±0.53 


2.050 ±0.024 


(2.42 ±0.47) • 10~ 5 


Ulysses^ 


5.38 ±0.31 


2.020 ±0.017 


(1.79 ±0.28) • 10~ 5 



Table II: Values of the parameters n a , 7, and a for the lemmatized versions (indicated with the superscript 
I) of the seven complete books. The fits are performed numerically through maximum likelihood estimation, 
while the standard deviations come from Monte Carlo simulations, see Appendix [Bl 

Although the tools we use are rather accurate, rare words are likely to be assigned a wrong lemma. 
This limitation is intrinsic to current computational tools and has to be considered as a part of 
the lemmatization process. Nevertheless, the fact that the behavior at low frequencies is robust 
in front of a large variation in the percentage of lemmatization errors implies that our result is a 
genuine consequence of the lemmatization. See Appendix lAl for more details. 



IV. AN ASYMPTOTIC APPROXIMATION OF HEAPS' LAW 

Coming back to our scaling ansatz, Eq. ([I]), we notice that the normalization of Dl(ti) will 
allow us to establish a relationship between the word-frequency distribution and the growth of the 
vocabulary with text length. In the continuous approximation, 

poo 1 poo j 1 poo 1 / 1 \ 

where we have used the previous relation g(x) = —G'(x), and have additionally imposed G(oo) = 0, 
for which it is necessary that g(x) decays faster than a power law with exponent one. So, 

v L = a (1) . (4) 

This just means that the number of words with relative frequency greater or equal than 1/L is 
the vocabulary size Vl, as this is the largest rank for a text of length L. It is important to notice 
the difference between saying that Gl(1/L) = Vl, which is a trivial statement, and stating that 
G(l/L) = Vl, which provides a link between Zipf's and Heaps' law, or more generally, between the 
distribution of frequencies and the vocabulary growth, by approximating the latter by the former. 
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The quality of such an approximation will depend, of course, on the goodness of the scale-invariance 
approximation. In the usual case of a power-law distribution of frequencies extending to the lowest 
values, g{x) cx 1/x 7 , with 7 > 1, then G{x) oc l/x 7_1 , which turns into Heaps' law, Vl oc L a , with 
a = 7 — 1, in agreement with previous works. 

However, this power-law growth of Vl with L is not what is observed in texts. Due to the 
accurate fit that we can achieve for lemmatized texts, we can explicitly derive an asymptotic 
expression for Vl given our proposal for g(x). As we have just shown, g{x) is not normalized to 
one, rather, fy L g(x)dx = Vl- Hence, substituting g(x) from Eq. (2) and integrating, 



In this case Vl is not a power law, and behaves asymptotically as oc InL. This is a direct con- 
sequence of our choice for the exponent 1 in the left-tail of g(x). Indeed, it seems clear that the 
vocabulary growth curve greatly deviates from a straight line in log-log space, for it displays a 
prominent convexity, see Fig. [5] as an example. Nevertheless, the result from Eq. ([5]) is not a good 
fit either, due to a wrong proportionality constant. This is caused by the continuous approximation 
in Eq. ©. 

For an accurate calculation of Vl we must treat our variables as discrete and compute discrete 
sums rather than integrals. In the exact, discrete treatment of Di(n), equation ([5]) must be 
rewritten as 



where we have used that <Sx tot (ro') = J2n>n' ^ > L tot ( n ) ( notice that in the discrete case, g{x) ^ 
—G'(x)). This is consistent with the fact that, indeed, the maximum likelihood parameters 7 and 
a have been computed assuming a discrete probability function (see Appendix [Bj) , and so has the 
normalization constant. We would like to stress that no fit is performed in Figure HI that is, the 
constant k in g(x) is directly derived from the normalizing constant of Dl(ti), and depends only 
on 7 and a. 
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Figure 4: The actual curve Vl (solid black with triangles) for the lemmatized version of the book Artamene, 
together with the curves Vl = G(l/L) obtained by using the empirical inverse of the rank-frequency plot, 
r = G(n/L), with Li = Ltot/lO^ 6-4 -'/ 5 (colors), and the analytical expression Eq. ^ with parameters 
determined from the fit of Oi tot (n), Eq. ([5]) (dashed black). 

V. CONCLUSIONS 



23 



28 



2i, 



In summary, we demonstrate that, contrary to what is claimed in previous works 
Zipf's law in linguistics is extraordinarily stable under changes in the size of the analyzed text. 
A scaling function g{x) provides a constant shape for the distribution of frequencies of each 
text, D L {n), no matter its length L, which only enters into the distribution as a scale parameter 
and determining the size of the vocabulary Vl- The apparent size-dependent exponent found 
previously seems to be an artifact of the slight convexity of g(x) in a log-log plot, which is 
more clearly observed for very small values of x, accessible only for the largest text lenghts. 
Moreover, we find that in the case of lemmatized texts the distribution can be well described by 
a double power-law behavior with a large-frequency exponent 7 that does not depend on L, and 
a transition point n a that scales linearly with L. The small-frequency exponent is different than 
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the one reported in Ref. 



for a non-lemmatized corpus. Further, the stability of the shape of 
the frequency distribution allows one to predict the growth of vocabulary size with text length, 
resulting in a generalization of the popular Heaps' law. The robustness of Zipf-like parameters 
under changes in system size opens the way to more practical applications of word statistics. In 
particular, we provide a consistent way to compare statistical properties of texts with different 
lengths 



371 ] . Another interesting issue would be the application of the same scaling methods to 
other fields in which Zipf's law has been proposed to hold, as economics and demography, for 
instance. 



Appendix A: Lemmatization 

To analyze the distribution of frequencies of lemmas, we needed to have the texts lemmatized. 
To manually lemmatize the words would have exceeded the possibilities of this project, so we 



proceeded to automatic processing with standard computational tools: FreeLing 
and English and TreeTagger [39J] for French. The tools carry out the following steps: 



for Spanish 



1. Tokenization: Segmentation of the texts into sentences and sentences into words (tokens). 

2. Morphological analysis: Assignment of one or more lemmas and morphological information 
(tag) to each token. For instance, found in English can correspond to the past tense of the 
verb find or to the base form of the verb found. At this stage, both are assigned whenever 
the word form found is encountered. 

3. Morphological disambiguation: An automatic tagger assigns the single most probable lemma 
and tag to each word form, depending on the context. For instance, in I found the keys the 
tagger would assign the lemma find to the word found, while in He promised to found a 
hospital, the lemma found would be preferred. 

All these steps are automatic, such that errors are introduced at each step. However, the 
accuracy of the tools is quite high (e.g., around 95-97% at the token level for morphological dis- 
ambiguation), such that a quantitative analysis based on the results of the automatic process can 
be carried out. Also note that step 2 is based on a pre-existing dictionary (of words, not of lem- 
mas, also called a lexicon): only the words that are in the dictionary are assigned a reliable set of 
morphological tags and lemmas. Although most of the tools used heuristically assign tag and/or 
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lemma information to words that are not in the dictionary, we only count tokens of lemmas for 
which the corresponding word types are found in the dictionary, so as to minimize the amount 
of error introduced by the automatic processing. This comes at the expense of losing some data. 
However, the dictionaries have quite a good coverage of the vocabulary, particularly at the token 
level, but also at the type level (see Table Hn|) . The exceptions are Ulysses, because of the stream 
of consciousness prose, which uses many non-standard word forms, and Artamene, because 17th 
century French contains many word forms that a dictionary of modern French does not include. 

Table III: Coverage of the vocabulary by the dictionary in each language, both at the type and at the token 
level. Remember that we distinguish between a word type (corresponding to its orthographic form) and its 
tokens (actual occurrences in text). 



title 


types tokens 


Clarissa 


68.0 


7, 


96.9 


% 


Moby-Dick 


70.8 


% 


94.7 


% 


Ulysses 


58.6 


% 


90.4 


% 


Don Quijotc 


81.3 


% 


97.0 


% 


La Rcgcnta 


89.5 


% 


97.9 


% 


Artamene 


43.6 


% 


83.6 


% 


Bragclonne 


89.8 


% 


97.5 


% 


Seitseman v. 


89.8 


% 


95.4 


% 


Kevat ja t. 


96.2 


% 


98.3 


% 


Vanhcmpieni r. 


96.5 


% 


98.5 


% 


average 


78.4 % 95.0 % 



Appendix B: Maximum likelihood fitting 

The fitted values of Table HIl have been obtained by maximum-likelihood estimation (MLE). 
This well-known procedure consists firstly in computing the log-likelihood function C, which in our 

case reads, 



1 Vl 1 Vl / 



with Hi the Vl values of the frequency and the normalization constant K in the discrete case equal 
to 
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K 



v L 



Note that we have reparameterized the distribution with regard the main text, introducing b = 
n^ -1 = aL 7_1 . Then, £ is maximized with respect to the parameters 7 and b; this has been 
done numerically using the simplex method [^(J. The error terms a 7 and cr&, representing the 
standard deviation of each estimator, are computed from Monte Carlo simulations: from the 
resulting maximumdikelihood parameters 7* and b*, synthetic data samples are simulated, and 
the MLE parameters of these samples are calculated in the same way; their fluctuations yield cr 7 
and Ob. We stress that no continuous approximation has been made, that is, the simulated data 
bllows the discrete probability function Di{n) (this is done using the rejection method, see Ref. 



3, 



4ll ] for details for a similar case). In a summarized recipe, the procedure simply is: 

1. Numerically compute the MLE parameters, 7* and b*. 

2. Draw M datasets, each of size Vl, from the discrete probability function D^(n;7*,6*). 

3. For each dataset m = 1 . . . M, compute the MLE parameters j m , b m . 

4. Compute the standard deviations 07 and of the sets {^ m }^ L=l and {6 m }^f =1 . 

The standard deviations of n a and a are computed in the same way using their relationship with 
b and 7. 
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